In this post I'd like to dive into NFL yards-per-play (ypp) outcomes, looking at ypp distributions under different conditions. I'll be using NLF 2009-2015 play-by-play data that I've downloaded using the awesome R library nflscrapR
For the R code used to query data using nflScrapR, or for an overview of the variables available with it, see [my previous blog post link]
In [481]:
#Load libraries
library(ggplot2)
library(sqldf)
library(dplyr)
library(scales)
library(assertthat)
#Import data
pbp_data <- read.csv('/home/max/nfl_stats/data/pbp_2009_2015.csv')
#Adjust default height of ggplot charts
options(repr.plot.height = 4)
Now let's filter the data a bit to keep only the rows/columns we're interested in. Also, let's make sure we count sacks as pass attempts.
In [482]:
#filter play-by-play data to just relevant rows/columns
#keep_columns = c('PlayType', 'GameID', 'Season', 'Drive', 'qtr', 'down', 'desc', 'yrdline100', 'ydstogo', 'posteam', 'DefensiveTeam', 'Yards.Gained', 'Touchdown', 'Safety', 'Fumble', 'InterceptionThrown', 'PenaltyType', 'Penalty.Yards', 'PenalizedTeam')
pbp_data_2 = subset(pbp_data, PlayType %in% c('Pass', 'Run', 'Sack') & down %in% c(1,2,3,4))#, select=keep_columns)
#Remove original pbp_data from memory so I don't accidentally reference it
rm(pbp_data)
#Sacks are pass attempts
pbp_data_2$PlayType[pbp_data_2$PlayType=='Sack'] <- 'Pass'
On average, an NFL play nets 5.5 yards. What does the distribution of outcomes look like?
In [191]:
ggplot(pbp_data_2, aes(x=Yards.Gained)) +
geom_histogram(binwidth=1)
What stands out here is the huge spike at 0 Yards.Gained. This makes sense, given that about 35-40% of all pass-plays are incomplete passes and so gain zero yards.
Let's break out the yards-per-play distributions by run vs pass plays. We can do this by adding fill=PlayType below. We'll also add y=..density.. so that the y-axis is formatted proportionally.
In [192]:
plot <- ggplot(pbp_data_2, aes(x=Yards.Gained, y=..density..)) +
geom_histogram(aes(fill=PlayType), binwidth=1, alpha=0.5, position="identity")
plot
This to me is a really interesting plot. One things to note: the bars for each PlayType sum to 1. So the spike for Pass plays at 0 shows that about 37% of all pass plays net 0-yards, not that 37% of all plays are 0-yard pass plays.
Before we dive deeper into slicing yards-per-play distributions, we should do something about the fact that our plot above is mostly empty space.
Cropping out the wasted horizontal space is straightforward. For the vertical space, we have a single vertical bar which we expect to extend way past all others in multiple variations of this chart. In this case, I think the best approach is to crop that as well, and leave an annotated note with how high it extends.
In [448]:
#Update plot with new/cropped x/y-axis ranges
plot <- plot + coord_cartesian(ylim = c(0, .15), xlim = c(-10, 40))
#Get internal data underlying plot-image as dataframe, from ggbuild_data(). For use in annotation
ggbuild_data <- as.data.frame(ggplot_build(plot)$data)
#Plot with annotation for height of truncated bar
annotated_plot <- plot + geom_label(data=ggbuild_data,
aes(x = 4, hjust=0, y = .14, label = paste0("0-Yard Pass: ",
round(max(density), digits=2))))
annotated_plot
This plot is much clearer, and looks much more amenable to comparing multiple versions side-by-side.
There's two plotting decision embedded in the code which is worth making clear: First, the plot above "zooms in" on the range of the plot we're interested in, but keeps the bar heights as the calculated proportions based on all the data. ggplot did this because we used xlim
& ylim
inside of coord_cartesian()
. If we had wanted the probability distribution itself to be re-calculated based on just the data within our narrowed window, we could have used xlim
& ylim
as ggplot options outside of coord_cartesian()
. That would filter the data-set before calculating the densities for each Yards.Gained value. See this Stack Overflow answer
Second, the .37 label/calculation is actually just taking the maximum height of any bar, which I assume will be 0-yard Pass for any slicing of this data given the prevalence of incomplete passes. But if I produce some variation of this plot where another bar goes off the screen, I'd need to
Creating grids of sub-plots with ggplot is really convenient, although you have to be careful when you want ggplot to calculate summary stats for you based on the faceting variable.
The code below shows how simple it is to break out probability distributions by categories, but also suggests to us that the max(..density..) calculation is not being grouped by the faceting variable "down". We're getting the same .37 proportion in each subplot that we know from above is the global average. For some addiitional discussion of this behavior, see these Stack Overflow threads here and here.
tl;dr - the labeled stats in the plot below are wrong
In [450]:
annotated_plot +
coord_cartesian(ylim=c(0,.25), xlim=c(-10, 40)) + #(extend y-axis up to .25 from .15 in last plot)
facet_wrap(~down, labeller = label_both) #(create subplots for each 'down' group)
To get the labeled stats to show up correctly by facet, we can:
In the code block below, the code comments highlight where steps 1/2/3 start.
In [195]:
#1. Create Previous 'plot' Histogram, Faceted on "Down" for Subplots, without stat labels
faceted_plot <- ggplot(pbp_data_2, aes(x=Yards.Gained, y=..density..)) +
geom_histogram(aes(fill=PlayType), binwidth=1, alpha=0.5, position="identity") +
coord_cartesian(ylim = c(0, .25), xlim = c(-10, 40)) +
facet_wrap(~down, labeller = label_both)
#2. Calculate the height of the '0-yard Pass' bar for each subplot
#ggplot_build()$data has most data underlying the plot image, including densities (bar heights)
ggbuild_data <- as.data.frame(ggplot_build(faceted_plot)$data)
#ggplot_build()$layout$panel_layout maps levels of PANEL(subplot) to levels of the faceting variable
ggbuild_panels <-ggplot_build(faceted_plot)$layout$panel_layout
#Merge two ggplot_build() datasets above so we can summarize ..density..'s by the faceting variable
ggbuild_info <- merge(ggbuild_data, ggbuild_panels[, c('PANEL', 'down')], by='PANEL', all.x = TRUE) #all.x=TRUE means left-join
#Throw error if merge somehow changed number of rows
assert_that(nrow(ggbuild_data) == nrow(ggbuild_info))
#Get Summary Stat by Panel/Down
max_bin_by_panel = sqldf("select PANEL, down, max(density) as max_density
from ggbuild_info
group by PANEL, down
")
#3. Add stat labels calculated (correctly) by down to faceted plot
labeled_faceted_plot <- faceted_plot + geom_label(data=max_bin_by_panel,
aes(x=17, y=0.23, label= paste0('0-Yard Passes:', round(max_density,2))))
labeled_faceted_plot
This is another interesting plot. A few things that stand out:
I'd like to reproduce the chart above but for some different groupings as well. To make this handy, we can pull out the grouping (facet) variable and some of the plot aesthetics from the code above, so that we can easily reference the previous code for new faceting variables with just a one-line function call.
In [196]:
#Define function to create labeled/faceted plot which accepts a grouping variable and certain plotting aesthetics
plot_nfl_yards <- function(dataframe=pbp_data_2, group_by_var=down,
facet_cols=2, coord_xlim=c(-10, 40), coord_ylim = c(0,.25)) {
#make sure "~[var]" in facet_wrap is understood by ggplot
facet_expression <- as.formula(paste("~", group_by_var))
#1. Create Yard.Gained Histogram, Colored by Pass/Run, Faceted on [group_by_var] for Subplots
faceted_plot <- ggplot(dataframe, aes(x=Yards.Gained, y=..density..)) +
geom_histogram(aes(fill=PlayType), binwidth=1, alpha=0.5, position="identity") +
coord_cartesian(ylim = coord_ylim, xlim = coord_xlim) +
facet_wrap(facet_expression, ncol=facet_cols, labeller = label_both)
#2. Calculate the height of the '0-yard Pass' bar for each subplot
#ggplot_build()$data has most data underlying the plot image, including densities (bar heights)
ggbuild_data <- as.data.frame(ggplot_build(faceted_plot)$data)
#ggplot_build()$layout$panel_layout maps levels of PANEL(subplot) to levels of the faceting variable
ggbuild_panels <-ggplot_build(faceted_plot)$layout$panel_layout
#Merge two ggplot_build() datasets above so we can summarize ..density..'s by the faceting variable
ggbuild_info <- merge(ggbuild_data, ggbuild_panels[, c('PANEL', group_by_var)], by='PANEL', all.x = TRUE) #all.x=TRUE means left-join
#Throw error if merge somehow changed number of rows
assert_that(nrow(ggbuild_data) == nrow(ggbuild_info))
#Get Summary Stat by Panel/Group_by_var
max_bin_by_panel = sqldf(paste("select PANEL, ", group_by_var, ", max(density) as max_density
from ggbuild_info
group by PANEL, ", group_by_var)
)
#3. Add stat labels calculated (correctly) by down to faceted plot
labeled_faceted_plot <- faceted_plot + geom_label(data=max_bin_by_panel,
aes(x=.6*max(coord_xlim), y=.9*max(coord_ylim), label= paste0('Tallest Bar:', round(max_density,2))))
return(labeled_faceted_plot)
}
Ok, first let's confirm we can use this function to get the same plot we had above, faceted by down.
In [197]:
plot_nfl_yards(group_by_var = 'down')
Cool. The only change is that I changed the label to say "Tallest Bar", since for certain slices of the data it might not be be 0-yard Pass attempts that we're truncating (if any). Let's look at some other ways of slicing the data.
In [199]:
plot_nfl_yards(group_by_var = 'Season')
Hm, that's not so interesting. Everything looks pretty consistent by year. To the extent there may be interesting changes in yards-per-play outcomes year-to-year, this probably isn't the best plot to highlight them.
How about looking at goal-to-go situations (no chance of gaining a first down short of the end-zone)?
In [201]:
plot_nfl_yards(group_by_var = "GoalToGo")
Interesting. So 0-yard pass attempts go from .36 to .48 when we enter GoalToGo situations. Since 0-yard pass attempts are mainly incomplete passes, we can interpret this as "pass completion percentage is about 64% when not in Goal-to-Go situations, and down to about 52% in Goal-to-Go situations."
In [204]:
plot_nfl_yards(group_by_var = 'Touchdown')
Another interesting plot. First thing that stands out to me: Touchdown plays with negative Yards.Gained?
In [217]:
td_but_neg_yards <- subset(pbp_data_2, Touchdown==1 & Yards.Gained < 0, select = c('desc', 'Touchdown', 'yrdline100', 'Yards.Gained', 'PlayType'))
head(td_but_neg_yards, 3)
Defensive Touchdowns - makes sense. And Jason Campbell makes an appearance! Sorry, JC.
The second thing that stands out is that for the first time, [0-yard pass] is not our tallest bar. Among TD-scoring plays, '1-yard Run' is our tallest bar.
This might seem to suggest that 1-yard Runs are the most common TD-scoring plays, but that would be a misinterpretion of this plot. Each color-group in each subplot sums to 1, so that .32 in the right panel means that of Run Plays that score a TD, 32% are 1-yard run plays. If we want to look at the probability of scoring a TD by running vs passing when close to the end zone, we want something more like this...
Third-down and Goal-to-Go...a Touchdown earns 7 points, and anything else is likely just 3 points. How do runs vs passes do, controlling for yard-line position?
In [523]:
third_and_goal_inside4 <- subset( pbp_data_2, yrdline100 <= 4 & down == 3 & GoalToGo == 1)
ggplot(third_and_goal_inside4) +
geom_histogram(aes(x=Touchdown, y=..density.., fill=PlayType), binwidth=1, alpha=0.5, position="identity") +
scale_x_continuous(breaks = c(0,1)) + #(scale continuous (0,1) is a hack because hist wont accept scale_discrete direclty or x=as.factor())
facet_wrap(down~yrdline100, ncol=5, labeller = label_both)
So on third-down from the 1-yardline, about 59% of run-plays gain a Touchdown while only about 53% of Pass attempts do. And on third-down from the 2-,3-, or 4-yardline, touchdown rates look even closer.
Honestly what's most interesting here to me is how even the Run vs Pass success rates are. The divergent trend which is hinted at but kind of obscured here, which I am now interested in extrapolating further, is how the TD-rate decreases as the yards away from the end-zone increases.
In [512]:
#'Touchdown' is always 0 or 1. Mean gives avg rate of 1, by group-by var(s)
plot_data <- pbp_data_2 %>%
filter(yrdline100 <= 10 & down==3 & GoalToGo==1) %>%
group_by(yrdline100) %>%
summarise(td_success_rate = mean(Touchdown))
#Plot
ggplot(plot_data, aes(x=yrdline100, y=td_success_rate)) +
geom_line() +
scale_x_continuous(breaks = seq(0,10, by=1)) +
scale_y_continuous(breaks = seq(0,.7, by = .1), limits=c(0,.7))
We can also modify this code slightly to plot different lines by down.
In [522]:
#Like above, except: 1) Do not filter on [down=3] and 2) Group/Summarize by (down, yrdline100), not just yrdline100
plot_data_2 <- pbp_data_2 %>%
filter(yrdline100 <= 10 & GoalToGo==1) %>%
group_by(down, yrdline100) %>%
summarise(td_success_rate = mean(Touchdown))
#Plot
ggplot(plot_data_2, aes(x=yrdline100, y=td_success_rate, colour=factor(down))) +
geom_line() +
scale_x_continuous(breaks = seq(0,10, by=1)) +
scale_y_continuous(breaks = seq(0,.7, by = .1), limits=c(0,.7))